MG ITALS: QC and imputation

Start date: 04-15-2020

End date: 04-16-2020

Analysed by: Ruth Chia

Working directory: /data/NDRS_LNG/MyastheniaGravis/updated.April2020/Itals


Files needed as the starting point

  1. Raw genotype plink binaries
  2. Phenotype file with minimum sex and age information

What needs to be done

  1. Run sample- and variant-level QC on raw genotype plink files.

  2. Submit for imputation using Topmed Imputation Server (hg38).

    • Download imputed data, filter to keep only variants with Rsq > 0.3 and maf > 0.001
  3. Generate PCs and covariate file.

  4. Run GLM (dosages) using imputed data

Run QC

QC script is written in two parts i.e.

  • part1: performs sample- and variant-level QC
  • part2: prepares files in the correct format for upload to imputation server.

Sample-level QC includes:

STEP 1 : SAMPLE-LEVEL FILTERING (HETEROZYGOSITY, CALL RATE, GENDER CHECK, ANCESTRY)

STEP 1.1: Sample-level filtering (Heterozygosity check, removal of het outliers)

STEP 1.2: Sample-level filtering (sample call rate check, maximum missing calls per sample --mind 0.05)

STEP 1.3: Sample-level filtering (Gender check, remove samples that failed gender check)

STEP 1.4: Sample-level filtering (Ancestry check, remove samples that are not european)

STEP 1.5: Generate list of duplicates; remove duplicates

STEP 1.6: Generate list of related samples


Variant-level QC includes:

STEP 2 : VARIANT-LEVEL FILTERING (Variant inclusion criteria to account for genotyping batch differences; then make final bfiles with --geno 0.05)

STEP 2.1: Case/control nonrandom missingness test (i.e. exclude SNPs with P \< 1E-4)

STEP 2.2: Haplotype-based test for non-random missing genotype data (i.e. exclude SNPs with P \< 1E-4)

STEP 2.3: Make list of varaints that failed HWE (controls) (i.e. exclude SNPs with midP \< 1E-10)

STEP 2.4: Final --geno 0.05 to account for final variant missingness

Run QC part 1

In [ ]:
%%bash
echo "sh ./scripts/QC_preimputation_v2_keepRelated_part1of2.mg.sh merged1 Itals_mg chiarp@mail.nih.gov" > qc1.swarm

swarm --file qc1.swarm --gres=lscratch:200 --logdir swarmOE_qc --partition quick \
-g 64 -t auto --time 4:00:00 --sbatch "--mail-type=BEGIN,FAIL,TIME_LIMIT_80"

Create 'CLEAN' genotype files from QC (step1) files

  1. without duplicates, keepRelated
  2. without duplicates, UNRELATED

without duplicates, keepRelated

move QC-ed files to specific directory

In [ ]:
%%bash
module load plink/1.9.0-beta4.4

mkdir CLEAN.rawGenotype.keepRelated
mv FILTERED.Itals_mg_keepRelated.* ./CLEAN.rawGenotype.keepRelated

cd CLEAN.rawGenotype.keepRelated
mv FILTERED.Itals_mg_keepRelated.bed FILTERED.Itals_mg_noDups.keepRelated.hwe1e-10.geno005.bed
mv FILTERED.Itals_mg_keepRelated.bim FILTERED.Itals_mg_noDups.keepRelated.hwe1e-10.geno005.bim
mv FILTERED.Itals_mg_keepRelated.fam FILTERED.Itals_mg_noDups.keepRelated.hwe1e-10.geno005.fam

without duplicates, UNRELATED

In [ ]:
%%bash
module load plink/1.9.0-beta4.4

mkdir CLEAN.rawGenotype.UNRELATED

plink \
--bfile CLEAN.rawGenotype.keepRelated/FILTERED.Itals_mg_noDups.keepRelated.hwe1e-10.geno005 \
--remove FILTERED.to.remove.GRM_matrix.noDups.relatedSamples_FIDspaceIID.txt \
--keep-allele-order \
--allow-no-sex \
--make-bed \
--out CLEAN.rawGenotype.UNRELATED/FILTERED.Itals_mg_noDups.UNRELATED.hwe1e-10.geno005

Run QC part 2

In [ ]:
%%bash
cp CLEAN.rawGenotype.keepRelated/FILTERED.Itals_mg_noDups.keepRelated.hwe1e-10.geno005.* .

echo "./scripts/QC_preimputation_v2_keepRelated_part2of2.sh FILTERED.Itals_mg_noDups.keepRelated.hwe1e-10.geno005 chiarp@mail.nih.gov" > QC_preimputation_part2.swarm
swarm --file QC_preimputation_part2.swarm --gres=lscratch:200 \
--module plink,R -g 32 --time 24:00:00 -t auto \
--sbatch "--mail-type=BEGIN,FAIL,END,TIME_LIMIT_80"
In [ ]:
%%bash
# Remove some intermediate files that are not needed to save some space
rm FILTERED.Itals_mg_noDups.keepRelated.hwe1e-10.geno005-chr*.check.*
rm FILTERED.Itals_mg_noDups.keepRelated.hwe1e-10.geno005X.*
rm FILTERED.Itals_mg_noDups.keepRelated.hwe1e-10.geno005[0-9].*
rm FILTERED.Itals_mg_noDups.keepRelated.hwe1e-10.geno005[0-9][0-9].*
rm FILTERED.Itals_mg_noDups.keepRelated.hwe1e-10.geno005-updated-chr*.*
rm *-FILTERED.Itals_mg_noDups.keepRelated.hwe1e-10.geno005-HRC.txt
rm Run-plink.sh
rm FILTERED.Itals_mg_noDups.keepRelated.hwe1e-10.geno005.*

Generate PCs (with genotype data) --> create covariate file

To create covariate files for:

  1. RELATED+UNRELATED samples
  2. UNRELATED samples only
In [11]:
%%bash
module load plink/1.9.0-beta4.4
module load R/3.5.2
module load GCTA/1.92.0beta3
module load flashpca/2.0

mkdir PCAcovs

# Prune snps
plink \
--bfile CLEAN.rawGenotype.keepRelated/FILTERED.Itals_mg_noDups.keepRelated.hwe1e-10.geno005 \
--allow-no-sex \
--maf 0.01 \
--geno 0.01 \
--hwe 5e-6 \
--autosome \
--exclude range ./scripts/exclusion_regions_hg19.txt \
--make-bed \
--out PCAcovs/FILTERED.Itals_mg_noDups.keepRelated.hwe1e-10.geno005_filter \
--memory 119500 --threads 19

plink \
--bfile PCAcovs/FILTERED.Itals_mg_noDups.keepRelated.hwe1e-10.geno005_filter \
--allow-no-sex \
--geno 0.01 \
--maf 0.05 \
--indep-pairwise 1000 10 0.02 \
--out PCAcovs/FILTERED.Itals_mg_noDups.keepRelated.hwe1e-10.geno005_filter_pruning \
--memory 119500 --threads 19

# All samples (related + unrelated samples; no dups)
plink \
--bfile PCAcovs/FILTERED.Itals_mg_noDups.keepRelated.hwe1e-10.geno005_filter \
--allow-no-sex \
--extract PCAcovs/FILTERED.Itals_mg_noDups.keepRelated.hwe1e-10.geno005_filter_pruning.prune.in \
--keep-allele-order \
--make-bed \
--out PCAcovs/pruned.Itals_mg_noDups.keepRelated \
--memory 119500 --threads 19

# Unrelated samples only (no dups)
plink \
--bfile PCAcovs/FILTERED.Itals_mg_noDups.keepRelated.hwe1e-10.geno005_filter \
--remove GRM/FILTERED.to.remove.GRM_matrix.noDups.relatedSamples_FIDspaceIID.txt \
--allow-no-sex \
--extract PCAcovs/FILTERED.Itals_mg_noDups.keepRelated.hwe1e-10.geno005_filter_pruning.prune.in \
--keep-allele-order \
--make-bed \
--out PCAcovs/pruned.Itals_mg_noDups.UNRELATED \
--memory 119500 --threads 19;

# Calculate/generate PCs based on pruned data set
cd PCAcovs
flashpca --bfile pruned.Itals_mg_noDups.keepRelated -d 10 --suffix _Itals_mg_noDups.keepRelated_forPCA --numthreads 19
flashpca --bfile pruned.Itals_mg_noDups.UNRELATED -d 10 --suffix _Itals_mg_noDups.UNRELATED_forPCA --numthreads 19

# Move all log files to a new folder
mkdir logFiles
mv *.log logFiles

# Remove intermediate files
rm FILTERED.Itals_mg_noDups.keepRelated.hwe1e-10.geno005_filter.bed
rm FILTERED.Itals_mg_noDups.keepRelated.hwe1e-10.geno005_filter.bim
rm FILTERED.Itals_mg_noDups.keepRelated.hwe1e-10.geno005_filter.fam
rm FILTERED.Itals_mg_noDups.keepRelated.hwe1e-10.geno005_filter_pruning.prune.in
rm FILTERED.Itals_mg_noDups.keepRelated.hwe1e-10.geno005_filter_pruning.prune.out
PLINK v1.90b4.4 64-bit (21 May 2017)           www.cog-genomics.org/plink/1.9/
(C) 2005-2017 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to PCAcovs/FILTERED.Itals_mg_noDups.keepRelated.hwe1e-10.geno005_filter.log.
Options in effect:
  --allow-no-sex
  --autosome
  --bfile CLEAN.rawGenotype.keepRelated/FILTERED.Itals_mg_noDups.keepRelated.hwe1e-10.geno005
  --exclude range ./scripts/exclusion_regions_hg19.txt
  --geno 0.01
  --hwe 5e-6
  --maf 0.01
  --make-bed
  --memory 119500
  --out PCAcovs/FILTERED.Itals_mg_noDups.keepRelated.hwe1e-10.geno005_filter
  --threads 19

257653 MB RAM detected; reserving 119500 MB for main workspace.
512030 out of 523703 variants loaded from .bim file.
3497 people (1914 males, 1583 females) loaded from .fam.
3497 phenotype values loaded from .fam.
--exclude range: 7558 variants excluded.
--exclude range: 504472 variants remaining.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 3497 founders and 0 nonfounders present.
Calculating allele frequencies... 0%1%2%3%4%5%6%7%8%9%10%11%12%13%14%15%16%17%18%19%20%21%22%23%24%25%26%27%28%29%30%31%32%33%34%35%36%37%38%39%40%41%42%43%44%45%46%47%48%49%50%51%52%53%54%55%56%57%58%59%60%61%62%63%64%65%66%67%68%69%70%71%72%73%74%75%76%77%78%79%80%81%82%83%84%85%86%87%88%89%90%91%92%93%94%95%96%97%98%99% done.
Total genotyping rate is 0.99796.
15317 variants removed due to missing genotype data (--geno).
--hwe: 106 variants removed due to Hardy-Weinberg exact test.
48648 variants removed due to minor allele threshold(s)
(--maf/--max-maf/--mac/--max-mac).
440401 variants and 3497 people pass filters and QC.
Among remaining phenotypes, 924 are cases and 2573 are controls.
--make-bed to
PCAcovs/FILTERED.Itals_mg_noDups.keepRelated.hwe1e-10.geno005_filter.bed +
PCAcovs/FILTERED.Itals_mg_noDups.keepRelated.hwe1e-10.geno005_filter.bim +
PCAcovs/FILTERED.Itals_mg_noDups.keepRelated.hwe1e-10.geno005_filter.fam ...
0%1%2%3%4%5%6%7%8%9%10%11%12%13%14%15%16%17%18%19%20%21%22%23%24%25%26%27%28%29%30%31%32%33%34%35%36%37%38%39%40%41%42%43%44%45%46%47%48%49%50%51%52%53%54%55%56%57%58%59%60%61%62%63%64%65%66%67%68%69%70%71%72%73%74%75%76%77%78%79%80%81%82%83%84%85%86%87%88%89%90%91%92%93%94%95%96%97%98%99%done.
PLINK v1.90b4.4 64-bit (21 May 2017)           www.cog-genomics.org/plink/1.9/
(C) 2005-2017 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to PCAcovs/FILTERED.Itals_mg_noDups.keepRelated.hwe1e-10.geno005_filter_pruning.log.
Options in effect:
  --allow-no-sex
  --bfile PCAcovs/FILTERED.Itals_mg_noDups.keepRelated.hwe1e-10.geno005_filter
  --geno 0.01
  --indep-pairwise 1000 10 0.02
  --maf 0.05
  --memory 119500
  --out PCAcovs/FILTERED.Itals_mg_noDups.keepRelated.hwe1e-10.geno005_filter_pruning
  --threads 19

257653 MB RAM detected; reserving 119500 MB for main workspace.
440401 variants loaded from .bim file.
3497 people (1914 males, 1583 females) loaded from .fam.
3497 phenotype values loaded from .fam.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 3497 founders and 0 nonfounders present.
Calculating allele frequencies... 0%1%2%3%4%5%6%7%8%9%10%11%12%13%14%15%16%17%18%19%20%21%22%23%24%25%26%27%28%29%30%31%32%33%34%35%36%37%38%39%40%41%42%43%44%45%46%47%48%49%50%51%52%53%54%55%56%57%58%59%60%61%62%63%64%65%66%67%68%69%70%71%72%73%74%75%76%77%78%79%80%81%82%83%84%85%86%87%88%89%90%91%92%93%94%95%96%97%98%99% done.
Total genotyping rate is 0.998415.
0 variants removed due to missing genotype data (--geno).
34181 variants removed due to minor allele threshold(s)
(--maf/--max-maf/--mac/--max-mac).
406220 variants and 3497 people pass filters and QC.
Among remaining phenotypes, 924 are cases and 2573 are controls.
Pruned 32611 variants from chromosome 1, leaving 1569.
Pruned 31861 variants from chromosome 2, leaving 1443.
Pruned 26675 variants from chromosome 3, leaving 1300.
Pruned 20816 variants from chromosome 4, leaving 1184.
Pruned 23152 variants from chromosome 5, leaving 1209.
Pruned 22505 variants from chromosome 6, leaving 1105.
Pruned 20688 variants from chromosome 7, leaving 1033.
Pruned 19940 variants from chromosome 8, leaving 905.
Pruned 18394 variants from chromosome 9, leaving 904.
Pruned 21983 variants from chromosome 10, leaving 997.
Pruned 20009 variants from chromosome 11, leaving 943.
Pruned 19716 variants from chromosome 12, leaving 956.
Pruned 14446 variants from chromosome 13, leaving 707.
Pruned 13117 variants from chromosome 14, leaving 683.
Pruned 12612 variants from chromosome 15, leaving 655.
Pruned 13439 variants from chromosome 16, leaving 726.
Pruned 11705 variants from chromosome 17, leaving 710.
Pruned 11870 variants from chromosome 18, leaving 682.
Pruned 7999 variants from chromosome 19, leaving 614.
Pruned 11187 variants from chromosome 20, leaving 617.
Pruned 5665 variants from chromosome 21, leaving 338.
Pruned 6175 variants from chromosome 22, leaving 375.
Pruning complete.  386565 of 406220 variants removed.
Marker lists written to
PCAcovs/FILTERED.Itals_mg_noDups.keepRelated.hwe1e-10.geno005_filter_pruning.prune.in
and
PCAcovs/FILTERED.Itals_mg_noDups.keepRelated.hwe1e-10.geno005_filter_pruning.prune.out
.
PLINK v1.90b4.4 64-bit (21 May 2017)           www.cog-genomics.org/plink/1.9/
(C) 2005-2017 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to PCAcovs/pruned.Itals_mg_noDups.keepRelated.log.
Options in effect:
  --allow-no-sex
  --bfile PCAcovs/FILTERED.Itals_mg_noDups.keepRelated.hwe1e-10.geno005_filter
  --extract PCAcovs/FILTERED.Itals_mg_noDups.keepRelated.hwe1e-10.geno005_filter_pruning.prune.in
  --keep-allele-order
  --make-bed
  --memory 119500
  --out PCAcovs/pruned.Itals_mg_noDups.keepRelated
  --threads 19

257653 MB RAM detected; reserving 119500 MB for main workspace.
440401 variants loaded from .bim file.
3497 people (1914 males, 1583 females) loaded from .fam.
3497 phenotype values loaded from .fam.
--extract: 19655 variants remaining.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 3497 founders and 0 nonfounders present.
Calculating allele frequencies... 0%1%2%3%4%5%6%7%8%9%10%11%12%13%14%15%16%17%18%19%20%21%22%23%24%25%26%27%28%29%30%31%32%33%34%35%36%37%38%39%40%41%42%43%44%45%46%47%48%49%50%51%52%53%54%55%56%57%58%59%60%61%62%63%64%65%66%67%68%69%70%71%72%73%74%75%76%77%78%79%80%81%82%83%84%85%86%87%88%89%90%91%92%93%94%95%96%97%98%99% done.
Total genotyping rate is 0.998432.
19655 variants and 3497 people pass filters and QC.
Among remaining phenotypes, 924 are cases and 2573 are controls.
--make-bed to PCAcovs/pruned.Itals_mg_noDups.keepRelated.bed +
PCAcovs/pruned.Itals_mg_noDups.keepRelated.bim +
PCAcovs/pruned.Itals_mg_noDups.keepRelated.fam ... 0%1%2%3%4%5%6%7%8%9%10%11%12%13%14%15%16%17%18%19%20%21%22%23%24%25%26%27%28%29%30%31%32%33%34%35%36%37%38%39%40%41%42%43%44%45%46%47%48%49%50%51%52%53%54%55%56%57%58%59%60%61%62%63%64%65%66%67%68%69%70%71%72%73%74%75%76%77%78%79%80%81%82%83%84%85%86%87%88%89%90%91%92%93%94%95%96%97%98%99%done.
PLINK v1.90b4.4 64-bit (21 May 2017)           www.cog-genomics.org/plink/1.9/
(C) 2005-2017 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to PCAcovs/pruned.Itals_mg_noDups.UNRELATED.log.
Options in effect:
  --allow-no-sex
  --bfile PCAcovs/FILTERED.Itals_mg_noDups.keepRelated.hwe1e-10.geno005_filter
  --extract PCAcovs/FILTERED.Itals_mg_noDups.keepRelated.hwe1e-10.geno005_filter_pruning.prune.in
  --keep-allele-order
  --make-bed
  --memory 119500
  --out PCAcovs/pruned.Itals_mg_noDups.UNRELATED
  --remove GRM/FILTERED.to.remove.GRM_matrix.noDups.relatedSamples_FIDspaceIID.txt
  --threads 19

257653 MB RAM detected; reserving 119500 MB for main workspace.
440401 variants loaded from .bim file.
3497 people (1914 males, 1583 females) loaded from .fam.
3497 phenotype values loaded from .fam.
--extract: 19655 variants remaining.
--remove: 3397 people remaining.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 3397 founders and 0 nonfounders present.
Calculating allele frequencies... 0%1%2%3%4%5%6%7%8%9%10%11%12%13%14%15%16%17%18%19%20%21%22%23%24%25%26%27%28%29%30%31%32%33%34%35%36%37%38%39%40%41%42%43%44%45%46%47%48%49%50%51%52%53%54%55%56%57%58%59%60%61%62%63%64%65%66%67%68%69%70%71%72%73%74%75%76%77%78%79%80%81%82%83%84%85%86%87%88%89%90%91%92%93%94%95%96%97%98%99% done.
Total genotyping rate in remaining samples is 0.998437.
19655 variants and 3397 people pass filters and QC.
Among remaining phenotypes, 919 are cases and 2478 are controls.
--make-bed to PCAcovs/pruned.Itals_mg_noDups.UNRELATED.bed +
PCAcovs/pruned.Itals_mg_noDups.UNRELATED.bim +
PCAcovs/pruned.Itals_mg_noDups.UNRELATED.fam ... 0%1%2%3%4%5%6%7%8%9%10%11%12%13%14%15%16%17%18%19%20%21%22%23%24%25%26%27%28%29%30%31%32%33%34%35%36%37%38%39%40%41%42%43%44%45%46%47%48%49%50%51%52%53%54%55%56%57%58%59%60%61%62%63%64%65%66%67%68%69%70%71%72%73%74%75%76%77%78%79%80%81%82%83%84%85%86%87%88%89%90%91%92%93%94%95%96%97%98%99%done.
[Fri Apr 24 21:36:00 2020] arguments: flashpca flashpca --bfile pruned.Itals_mg_noDups.keepRelated -d 10 --suffix _Itals_mg_noDups.keepRelated_forPCA --numthreads 19 
[Fri Apr 24 21:36:00 2020] Start flashpca (version 2.0)
[Fri Apr 24 21:36:00 2020] seed: 1
[Fri Apr 24 21:36:00 2020] blocksize: 19655 (549868280 bytes per block)
[Fri Apr 24 21:36:00 2020] PCA begin
[Fri Apr 24 21:36:10 2020] PCA done
[Fri Apr 24 21:36:10 2020] Writing 10 eigenvalues to file eigenvalues_Itals_mg_noDups.keepRelated_forPCA
[Fri Apr 24 21:36:10 2020] Writing 10 eigenvectors to file eigenvectors_Itals_mg_noDups.keepRelated_forPCA
[Fri Apr 24 21:36:10 2020] Writing 10 PCs to file pcs_Itals_mg_noDups.keepRelated_forPCA
[Fri Apr 24 21:36:10 2020] Writing 10 proportion variance explained to file pve_Itals_mg_noDups.keepRelated_forPCA
[Fri Apr 24 21:36:10 2020] Goodbye!
[Fri Apr 24 21:36:10 2020] arguments: flashpca flashpca --bfile pruned.Itals_mg_noDups.UNRELATED -d 10 --suffix _Itals_mg_noDups.UNRELATED_forPCA --numthreads 19 
[Fri Apr 24 21:36:10 2020] Start flashpca (version 2.0)
[Fri Apr 24 21:36:10 2020] seed: 1
[Fri Apr 24 21:36:10 2020] blocksize: 19655 (534144280 bytes per block)
[Fri Apr 24 21:36:10 2020] PCA begin
[Fri Apr 24 21:36:32 2020] PCA done
[Fri Apr 24 21:36:32 2020] Writing 10 eigenvalues to file eigenvalues_Itals_mg_noDups.UNRELATED_forPCA
[Fri Apr 24 21:36:32 2020] Writing 10 eigenvectors to file eigenvectors_Itals_mg_noDups.UNRELATED_forPCA
[Fri Apr 24 21:36:32 2020] Writing 10 PCs to file pcs_Itals_mg_noDups.UNRELATED_forPCA
[Fri Apr 24 21:36:32 2020] Writing 10 proportion variance explained to file pve_Itals_mg_noDups.UNRELATED_forPCA
[Fri Apr 24 21:36:32 2020] Goodbye!
[+] Loading plink  1.9.0-beta4.4  on cn2121 
[+] Loading gcc  7.3.0  ... 
[+] Loading GSL 2.4 for GCC 7.2.0 ... 
[-] Unloading gcc  7.3.0  ... 
[+] Loading gcc  7.3.0  ... 
[+] Loading openmpi 3.0.2  for GCC 7.3.0 
[+] Loading ImageMagick  7.0.8  on cn2121 
[+] Loading HDF5  1.10.4 
[+] Loading pandoc  2.9.2.1  on cn2121 
[+] Loading R 3.5.2 
[+] Loading GCTA  1.92.0beta3 
[+] Loading flashpca  2.0  on cn2121 

Create covariate file

Use updated phenotype file as input for here: phenotype_mg_Carlo2018_2012_ItalsControls.updatedMarch2020.FINAL.txt. Then subset to only keep samples that made it to the 'CLEAN' plink1 files.

In [7]:
%%bash

module load R/3.5.2
R --vanilla --no-save

# Load libraries
require(data.table)
require(tidyverse)

# UNRELATED samples only (no dups)
## Read in files
pc <- fread("PCAcovs/pcs_Itals_mg_noDups.UNRELATED_forPCA", header=T)
fam <- fread("PCAcovs/pruned.Itals_mg_noDups.UNRELATED.fam", header=F)
fam <- fam %>% rename(FID = V1, IID = V2, MAT = V3, PAT = V4, SEX = V5, PHENO= V6)
pheno <- fread("phenotype_mg_Carlo2018_2012_ItalsControls.updatedMarch2020.FINAL.txt", header=T)
pheno <- pheno %>% rename(genderONFILE = gender)

dim(pc)
dim(fam)
dim(pheno)

## Merge pc, fam and pheno files. 
## Also create additional column so that the sample ID matches the format in the dose.vcf.gz (i.e. FID_IID)
data1 <- merge(fam,pheno,by=c("FID","IID"))
data2 <- merge(data1, pc, by=c("FID","IID"))
data2$GENDER <- data2$SEX - 1
data2$genderONFILE[is.na(data2$genderONFILE)] <- "NA"
data2$FID_IID <- paste(data2$FID,data2$IID, sep="_")
data2$age_at_onset <- as.numeric(data2$age_at_onset)
write.table(data2, "COVARIATES.Itals_mg_noDups.UNRELATED.txt", sep="\t",quote=F, col.names=T,row.names=F)

data3 <- data2 %>%
         mutate(FID = FID_IID, IID = FID_IID) %>%
         select(-FID_IID)
write.table(data3, "COVARIATES.Itals_mg_noDups.UNRELATED.forGLM.txt", sep="\t",quote=F, col.names=T,row.names=F)
head(data3)


## Summarise numbers by pheno or diagnosis and gender or sex
## Note that there is a mismatch of gender from pheno file compared to the actual SEX information in the plink file. 
## But since these samples survived gender check. 
## Decided to stick with information in the .fam file
table(data2$PHENO, data2$SEX)
table(data2$diagnosis, data2$SEX)
table(data2$diagnosis, data2$genderONFILE)
table(data2$diagnosis, data2$GENDER)

head(data2)

## Run step() to determine which covariate to correct for for association analysis
data2$CASE <- data2$PHENO - 1
unrelated <- glm(CASE ~ GENDER + age_at_onset + PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8 + PC9 + PC10 , data = data2, family = "binomial"(link = "logit"))
summary(unrelated)
step(unrelated)
R version 3.5.2 (2018-12-20) -- "Eggshell Igloo"
Copyright (C) 2018 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> 
> # Load libraries
> require(data.table)
> require(tidyverse)
> 
> # UNRELATED samples only (no dups)
> ## Read in files
> pc <- fread("PCAcovs/pcs_Itals_mg_noDups.UNRELATED_forPCA", header=T)
> fam <- fread("PCAcovs/pruned.Itals_mg_noDups.UNRELATED.fam", header=F)
> fam <- fam %>% rename(FID = V1, IID = V2, MAT = V3, PAT = V4, SEX = V5, PHENO= V6)
> pheno <- fread("phenotype_mg_Carlo2018_2012_ItalsControls.updatedMarch2020.FINAL.txt", header=T)
> pheno <- pheno %>% rename(genderONFILE = gender)
> 
> dim(pc)
[1] 3397   12
> dim(fam)
[1] 3397    6
> dim(pheno)
[1] 4253    5
> 
> ## Merge pc, fam and pheno files. 
> ## Also create additional column so that the sample ID matches the format in the dose.vcf.gz (i.e. FID_IID)
> data1 <- merge(fam,pheno,by=c("FID","IID"))
> data2 <- merge(data1, pc, by=c("FID","IID"))
> data2$GENDER <- data2$SEX - 1
> data2$genderONFILE[is.na(data2$genderONFILE)] <- "NA"
> data2$FID_IID <- paste(data2$FID,data2$IID, sep="_")
> data2$age_at_onset <- as.numeric(data2$age_at_onset)
> #write.table(data2, "COVARIATES.Itals_mg_noDups.UNRELATED.txt", sep="\t",quote=F, col.names=T,row.names=F)
> 
> data3 <- data2 %>%
+          mutate(FID = FID_IID, IID = FID_IID) %>%
+          select(-FID_IID)
> #write.table(data3, "COVARIATES.Itals_mg_noDups.UNRELATED.forGLM.txt", sep="\t",quote=F, col.names=T,row.names=F)
> head(data3)
                  FID                 IID MAT PAT SEX PHENO genderONFILE
1 10039-DVG_10039-DVG 10039-DVG_10039-DVG   0   0   2     1            1
2   10071-BR_10071-BR   10071-BR_10071-BR   0   0   2     1            1
3   10081-BC_10081-BC   10081-BC_10081-BC   0   0   1     1            0
4 10137-MMT_10137-MMT 10137-MMT_10137-MMT   0   0   2     1            1
5   10147-OR_10147-OR   10147-OR_10147-OR   0   0   1     1            0
6     1022-TS_1022-TS     1022-TS_1022-TS   0   0   1     1            0
  age_at_onset diagnosis          PC1         PC2          PC3          PC4
1           60   control -0.025312730  0.02164687  0.006001769  0.013145200
2           54   control -0.020662860  0.02438660 -0.041687700  0.044762190
3           75   control -0.013087100 -0.08244215 -0.014605790 -0.008541102
4           63   control -0.026193970 -0.01373366  0.012337870 -0.003966964
5           69   control -0.007320636 -0.04046117 -0.020739560 -0.010027520
6           55   control -0.042137960  0.02172641 -0.006119175 -0.027821780
           PC5          PC6          PC7         PC8          PC9          PC10
1 -0.025761550  0.012118420  0.026402570 -0.03195026  0.016432260  0.0231902000
2  0.027410460  0.013000710  0.017429430 -0.04442212  0.011062090  0.0091400870
3 -0.020152580 -0.008590607 -0.006106247 -0.02398282  0.005996563  0.0022551240
4  0.006837094 -0.049749590 -0.039859960  0.05736454 -0.037509020  0.0080890020
5  0.019149870  0.026729550  0.039740670 -0.02125327  0.020937230 -0.0004138881
6  0.015159360  0.011921870 -0.014626500 -0.01505404  0.058594910  0.0331456400
  GENDER
1      1
2      1
3      0
4      1
5      0
6      0
> 
> 
> ## Summarise numbers by pheno or diagnosis and gender or sex
> ## Note that there is a mismatch of gender from pheno file compared to the actual SEX information in the plink file. 
> ## But since these samples survived gender check. 
> ## Decided to stick with information in the .fam file
> table(data2$PHENO, data2$SEX)
   
       1    2
  1 1421 1057
  2  451  468
> table(data2$diagnosis, data2$SEX)
         
             1    2
  case     451  468
  control 1421 1057
> table(data2$diagnosis, data2$genderONFILE)
         
             0    1
  case     451  468
  control 1421 1057
> table(data2$diagnosis, data2$GENDER)
         
             0    1
  case     451  468
  control 1421 1057
> 
> head(data2)
         FID       IID MAT PAT SEX PHENO genderONFILE age_at_onset diagnosis
1: 10039-DVG 10039-DVG   0   0   2     1            1           60   control
2:  10071-BR  10071-BR   0   0   2     1            1           54   control
3:  10081-BC  10081-BC   0   0   1     1            0           75   control
4: 10137-MMT 10137-MMT   0   0   2     1            1           63   control
5:  10147-OR  10147-OR   0   0   1     1            0           69   control
6:   1022-TS   1022-TS   0   0   1     1            0           55   control
            PC1         PC2          PC3          PC4          PC5          PC6
1: -0.025312730  0.02164687  0.006001769  0.013145200 -0.025761550  0.012118420
2: -0.020662860  0.02438660 -0.041687700  0.044762190  0.027410460  0.013000710
3: -0.013087100 -0.08244215 -0.014605790 -0.008541102 -0.020152580 -0.008590607
4: -0.026193970 -0.01373366  0.012337870 -0.003966964  0.006837094 -0.049749590
5: -0.007320636 -0.04046117 -0.020739560 -0.010027520  0.019149870  0.026729550
6: -0.042137960  0.02172641 -0.006119175 -0.027821780  0.015159360  0.011921870
            PC7         PC8          PC9          PC10 GENDER
1:  0.026402570 -0.03195026  0.016432260  0.0231902000      1
2:  0.017429430 -0.04442212  0.011062090  0.0091400870      1
3: -0.006106247 -0.02398282  0.005996563  0.0022551240      0
4: -0.039859960  0.05736454 -0.037509020  0.0080890020      1
5:  0.039740670 -0.02125327  0.020937230 -0.0004138881      0
6: -0.014626500 -0.01505404  0.058594910  0.0331456400      0
               FID_IID
1: 10039-DVG_10039-DVG
2:   10071-BR_10071-BR
3:   10081-BC_10081-BC
4: 10137-MMT_10137-MMT
5:   10147-OR_10147-OR
6:     1022-TS_1022-TS
> 
> ## Run step() to determine which covariate to correct for for association analysis
> data2$CASE <- data2$PHENO - 1
> unrelated <- glm(CASE ~ GENDER + age_at_onset + PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8 + PC9 + PC10 , data = data2, family = "binomial"(link = "logit"))
> summary(unrelated)

Call:
glm(formula = CASE ~ GENDER + age_at_onset + PC1 + PC2 + PC3 + 
    PC4 + PC5 + PC6 + PC7 + PC8 + PC9 + PC10, family = binomial(link = "logit"), 
    data = data2)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.6420  -0.8706  -0.6327   1.1823   2.5921  

Coefficients:
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)   -0.949778   0.135406  -7.014 2.31e-12 ***
GENDER         0.462383   0.082378   5.613 1.99e-08 ***
age_at_onset  -0.008266   0.002428  -3.405 0.000662 ***
PC1          -15.832483   1.296204 -12.215  < 2e-16 ***
PC2           -2.099636   1.236276  -1.698 0.089441 .  
PC3          -10.034591   1.943518  -5.163 2.43e-07 ***
PC4           -3.497703   1.768651  -1.978 0.047973 *  
PC5            5.670583   1.728750   3.280 0.001037 ** 
PC6           -3.991650   1.721965  -2.318 0.020445 *  
PC7            0.616184   1.693712   0.364 0.716002    
PC8            1.018688   1.690342   0.603 0.546740    
PC9           -0.554474   1.701010  -0.326 0.744449    
PC10          -3.402406   1.695711  -2.006 0.044805 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 3898.9  on 3321  degrees of freedom
Residual deviance: 3546.7  on 3309  degrees of freedom
  (75 observations deleted due to missingness)
AIC: 3572.7

Number of Fisher Scoring iterations: 5

> step(unrelated)
Start:  AIC=3572.66
CASE ~ GENDER + age_at_onset + PC1 + PC2 + PC3 + PC4 + PC5 + 
    PC6 + PC7 + PC8 + PC9 + PC10

               Df Deviance    AIC
- PC9           1   3546.8 3570.8
- PC7           1   3546.8 3570.8
- PC8           1   3547.0 3571.0
<none>              3546.7 3572.7
- PC2           1   3549.5 3573.5
- PC4           1   3550.7 3574.7
- PC10          1   3550.7 3574.7
- PC6           1   3552.1 3576.1
- PC5           1   3558.0 3582.0
- age_at_onset  1   3558.3 3582.3
- PC3           1   3576.1 3600.1
- GENDER        1   3578.3 3602.3
- PC1           1   3780.1 3804.1

Step:  AIC=3570.77
CASE ~ GENDER + age_at_onset + PC1 + PC2 + PC3 + PC4 + PC5 + 
    PC6 + PC7 + PC8 + PC10

               Df Deviance    AIC
- PC7           1   3546.9 3568.9
- PC8           1   3547.1 3569.1
<none>              3546.8 3570.8
- PC2           1   3549.6 3571.6
- PC4           1   3550.7 3572.7
- PC10          1   3550.8 3572.8
- PC6           1   3552.2 3574.2
- PC5           1   3558.1 3580.1
- age_at_onset  1   3558.3 3580.3
- PC3           1   3576.3 3598.3
- GENDER        1   3578.5 3600.5
- PC1           1   3780.3 3802.3

Step:  AIC=3568.91
CASE ~ GENDER + age_at_onset + PC1 + PC2 + PC3 + PC4 + PC5 + 
    PC6 + PC8 + PC10

               Df Deviance    AIC
- PC8           1   3547.3 3567.3
<none>              3546.9 3568.9
- PC2           1   3549.8 3569.8
- PC4           1   3550.9 3570.9
- PC10          1   3551.0 3571.0
- PC6           1   3552.3 3572.3
- PC5           1   3558.2 3578.2
- age_at_onset  1   3558.5 3578.5
- PC3           1   3576.4 3596.4
- GENDER        1   3578.7 3598.7
- PC1           1   3780.3 3800.3

Step:  AIC=3567.27
CASE ~ GENDER + age_at_onset + PC1 + PC2 + PC3 + PC4 + PC5 + 
    PC6 + PC10

               Df Deviance    AIC
<none>              3547.3 3567.3
- PC2           1   3550.1 3568.1
- PC4           1   3551.3 3569.3
- PC10          1   3551.4 3569.4
- PC6           1   3552.7 3570.7
- PC5           1   3558.5 3576.5
- age_at_onset  1   3558.9 3576.9
- PC3           1   3576.7 3594.7
- GENDER        1   3579.1 3597.1
- PC1           1   3780.8 3798.8

Call:  glm(formula = CASE ~ GENDER + age_at_onset + PC1 + PC2 + PC3 + 
    PC4 + PC5 + PC6 + PC10, family = binomial(link = "logit"), 
    data = data2)

Coefficients:
 (Intercept)        GENDER  age_at_onset           PC1           PC2  
   -0.951868      0.463687     -0.008232    -15.837941     -2.095995  
         PC3           PC4           PC5           PC6          PC10  
  -10.030466     -3.517386      5.637137     -4.008652     -3.423690  

Degrees of Freedom: 3321 Total (i.e. Null);  3312 Residual
  (75 observations deleted due to missingness)
Null Deviance:	    3899 
Residual Deviance: 3547 	AIC: 3567
> 
[-] Unloading gcc  9.2.0  ... 
[-] Unloading GSL 2.6 for GCC 9.2.0 ... 
[-] Unloading openmpi 4.0.5  for GCC 9.2.0 
[-] Unloading ImageMagick  7.0.8  on cn3454 
[-] Unloading HDF5  1.10.4 
[-] Unloading NetCDF 4.7.4_gcc9.2.0 
[-] Unloading pandoc  2.13  on cn3454 
[-] Unloading pcre2 10.21  ... 
[-] Unloading R 4.0.5 
[+] Loading gcc  7.3.0  ... 
[+] Loading GSL 2.4 for GCC 7.2.0 ... 
[-] Unloading gcc  7.3.0  ... 
[+] Loading gcc  7.3.0  ... 
[+] Loading openmpi 3.0.2  for GCC 7.3.0 
[+] Loading ImageMagick  7.0.8  on cn3454 
[+] Loading HDF5  1.10.4 
[+] Loading pandoc  2.13  on cn3454 
[+] Loading R 3.5.2 

The following have been reloaded with a version change:
  1) GSL/2.6_gcc-9.2.0 => GSL/2.4_gcc-7.2.0     3) gcc/9.2.0 => gcc/7.3.0
  2) R/4.0 => R/3.5.2

Loading required package: data.table
Loading required package: tidyverse
-- Attaching packages --------------------------------------- tidyverse 1.2.1 --
v ggplot2 3.3.2     v purrr   0.3.4
v tibble  3.0.3     v dplyr   0.8.5
v tidyr   0.8.3     v stringr 1.4.0
v readr   1.3.1     v forcats 0.5.0
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::between()   masks data.table::between()
x dplyr::filter()    masks stats::filter()
x dplyr::first()     masks data.table::first()
x dplyr::lag()       masks stats::lag()
x dplyr::last()      masks data.table::last()
x purrr::transpose() masks data.table::transpose()

Results from step(); covariates to adjust for

UNRELATED samples only (no dups):

  • Lowest AIC = 3567.27
  • Covariates to adjust for: GENDER,age_at_onset,PC1,PC2,PC3,PC4,PC5,PC6,PC10

PC plots for cases and controls

In [14]:
%%bash
# Plot PCs to check to make sure cases and controls are not clustering out
module load R/3.5.2
Rscript ./scripts/PCplots_MG.GWAS.R COVARIATES.Itals_mg_noDups.keepRelated.txt PCplots_Itals_mg_noDups.keepRelated
Rscript ./scripts/PCplots_MG.GWAS.R COVARIATES.Itals_mg_noDups.UNRELATED.txt PCplots_Itals_mg_noDups.UNRELATED
[+] Loading gcc  7.3.0  ... 
[+] Loading GSL 2.4 for GCC 7.2.0 ... 
[-] Unloading gcc  7.3.0  ... 
[+] Loading gcc  7.3.0  ... 
[+] Loading openmpi 3.0.2  for GCC 7.3.0 
[+] Loading ImageMagick  7.0.8  on cn2121 
[+] Loading HDF5  1.10.4 
[+] Loading pandoc  2.9.2.1  on cn2121 
[+] Loading R 3.5.2 
Loading required package: data.table
Loading required package: tidyverse
-- Attaching packages --------------------------------------- tidyverse 1.2.1 --
v ggplot2 3.3.0     v purrr   0.3.3
v tibble  2.1.3     v dplyr   0.8.5
v tidyr   0.8.3     v stringr 1.4.0
v readr   1.3.1     v forcats 0.4.0
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::between()   masks data.table::between()
x dplyr::filter()    masks stats::filter()
x dplyr::first()     masks data.table::first()
x dplyr::lag()       masks stats::lag()
x dplyr::last()      masks data.table::last()
x purrr::transpose() masks data.table::transpose()

Attaching package: 'gridExtra'

The following object is masked from 'package:dplyr':

    combine

There were 45 warnings (use warnings() to see them)
Loading required package: data.table
Loading required package: tidyverse
-- Attaching packages --------------------------------------- tidyverse 1.2.1 --
v ggplot2 3.3.0     v purrr   0.3.3
v tibble  2.1.3     v dplyr   0.8.5
v tidyr   0.8.3     v stringr 1.4.0
v readr   1.3.1     v forcats 0.4.0
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::between()   masks data.table::between()
x dplyr::filter()    masks stats::filter()
x dplyr::first()     masks data.table::first()
x dplyr::lag()       masks stats::lag()
x dplyr::last()      masks data.table::last()
x purrr::transpose() masks data.table::transpose()

Attaching package: 'gridExtra'

The following object is masked from 'package:dplyr':

    combine

There were 17 warnings (use warnings() to see them)
In [2]:
from IPython.display import display
from PIL import Image

print("PC plots for samples and cases (UNRELATED samples only)")
PCplot="PCplots_Itals_mg_noDups.UNRELATED_prunedSNPs_hwe5e-6.geno001.maf001.jpg"
display(Image.open(PCplot))
PC plots for samples and cases (UNRELATED samples only)

Submitted files for imputation at MIS

Uploaded .vcf.gz files for imputation:

  • pre_impute_FILTERED.Itals_mg_noDups.keepRelated.hwe1e-10.geno005-chr1.vcf.gz
  • pre_impute_FILTERED.Itals_mg_noDups.keepRelated.hwe1e-10.geno005-chr2.vcf.gz
  • pre_impute_FILTERED.Itals_mg_noDups.keepRelated.hwe1e-10.geno005-chr3.vcf.gz
  • pre_impute_FILTERED.Itals_mg_noDups.keepRelated.hwe1e-10.geno005-chr4.vcf.gz
  • pre_impute_FILTERED.Itals_mg_noDups.keepRelated.hwe1e-10.geno005-chr5.vcf.gz
  • pre_impute_FILTERED.Itals_mg_noDups.keepRelated.hwe1e-10.geno005-chr6.vcf.gz
  • pre_impute_FILTERED.Itals_mg_noDups.keepRelated.hwe1e-10.geno005-chr7.vcf.gz
  • pre_impute_FILTERED.Itals_mg_noDups.keepRelated.hwe1e-10.geno005-chr8.vcf.gz
  • pre_impute_FILTERED.Itals_mg_noDups.keepRelated.hwe1e-10.geno005-chr9.vcf.gz
  • pre_impute_FILTERED.Itals_mg_noDups.keepRelated.hwe1e-10.geno005-chr10.vcf.gz
  • pre_impute_FILTERED.Itals_mg_noDups.keepRelated.hwe1e-10.geno005-chr11.vcf.gz
  • pre_impute_FILTERED.Itals_mg_noDups.keepRelated.hwe1e-10.geno005-chr12.vcf.gz
  • pre_impute_FILTERED.Itals_mg_noDups.keepRelated.hwe1e-10.geno005-chr13.vcf.gz
  • pre_impute_FILTERED.Itals_mg_noDups.keepRelated.hwe1e-10.geno005-chr14.vcf.gz
  • pre_impute_FILTERED.Itals_mg_noDups.keepRelated.hwe1e-10.geno005-chr15.vcf.gz
  • pre_impute_FILTERED.Itals_mg_noDups.keepRelated.hwe1e-10.geno005-chr16.vcf.gz
  • pre_impute_FILTERED.Itals_mg_noDups.keepRelated.hwe1e-10.geno005-chr17.vcf.gz
  • pre_impute_FILTERED.Itals_mg_noDups.keepRelated.hwe1e-10.geno005-chr18.vcf.gz
  • pre_impute_FILTERED.Itals_mg_noDups.keepRelated.hwe1e-10.geno005-chr19.vcf.gz
  • pre_impute_FILTERED.Itals_mg_noDups.keepRelated.hwe1e-10.geno005-chr20.vcf.gz
  • pre_impute_FILTERED.Itals_mg_noDups.keepRelated.hwe1e-10.geno005-chr21.vcf.gz
  • pre_impute_FILTERED.Itals_mg_noDups.keepRelated.hwe1e-10.geno005-chr22.vcf.gz
In [ ]:
%%bash
echo "sh scripts/MG.ITALS_ImputationResultsDownload.hg38_and_plink_04-28-2020.sh" > MG.ITALS_ImputationDownload.hg38.swarm
swarm --file MG.ITALS_ImputationDownload.hg38.swarm \
--logdir swarmOE_ImputationDownload \
--gres=lscratch:200 \
--module plink/1.9.0-beta4.4 \
-g 2 -t auto --time 24:00:00 \
--sbatch "--mail-type=BEGIN,FAIL,TIME_LIMIT_80"

With the output from minimac 4, unzipped files does not automatically create an index file. Do this manually.

In [ ]:
%%bash
cd Imputation.hg38
for CHNUM in {1..22};
do
echo "tabix -p vcf chr${CHNUM}.dose.vcf.gz" >> tabix_dose.vcf.gz.swarm
done
swarm --file tabix_dose.vcf.gz.swarm --gres=lscratch:400 \
--logdir swarmOE_tabix_dose.vcf.gz \
--module samtools -g 120 -t auto --time 72:00:00 \
--sbatch "--mail-type=BEGIN,FAIL,TIME_LIMIT_80"

Check total number of variants imputed

In [4]:
%%bash
DIR="/data/NDRS_LNG/MyastheniaGravis/updated.April2020/Itals"
cd $DIR/Imputation.hg38
wc -l chr[0-9]*.info
   23617900 chr1.info
   14212906 chr10.info
   14563388 chr11.info
   13986366 chr12.info
   10435997 chr13.info
    9280421 chr14.info
    8667566 chr15.info
    9840732 chr16.info
    8494641 chr17.info
    8253586 chr18.info
    6497881 chr19.info
   25543692 chr2.info
    6631675 chr20.info
    3829414 chr21.info
    4140642 chr22.info
   20864167 chr3.info
   20250718 chr4.info
   18909563 chr5.info
   17626672 chr6.info
   16990285 chr7.info
   16305107 chr8.info
   13114802 chr9.info
  292058121 total

Create list variants to keep for analysis based on Rsq > 0.3 and MAF > 0.0001

In [3]:
%%bash
mkdir Imputation.hg38/Vars.Rsq03MAF00001
cd Imputation.hg38

for CHNUM in {1..22} 
do
awk '{if($5 > 0.0001 && $7 > 0.3) print $1}' chr${CHNUM}.info | tail -n +2 > Vars.Rsq03MAF00001/Rsq03MAF00001.chr${CHNUM}.txt
done
In [6]:
!wc -l Imputation.hg38/Vars.Rsq03MAF00001/Rsq03MAF00001.chr[0-9]*.txt
  3472175 Imputation.hg38/Vars.Rsq03MAF00001/Rsq03MAF00001.chr1.txt
  2167295 Imputation.hg38/Vars.Rsq03MAF00001/Rsq03MAF00001.chr10.txt
  2157801 Imputation.hg38/Vars.Rsq03MAF00001/Rsq03MAF00001.chr11.txt
  2098482 Imputation.hg38/Vars.Rsq03MAF00001/Rsq03MAF00001.chr12.txt
  1577715 Imputation.hg38/Vars.Rsq03MAF00001/Rsq03MAF00001.chr13.txt
  1385659 Imputation.hg38/Vars.Rsq03MAF00001/Rsq03MAF00001.chr14.txt
  1261076 Imputation.hg38/Vars.Rsq03MAF00001/Rsq03MAF00001.chr15.txt
  1378086 Imputation.hg38/Vars.Rsq03MAF00001/Rsq03MAF00001.chr16.txt
  1214395 Imputation.hg38/Vars.Rsq03MAF00001/Rsq03MAF00001.chr17.txt
  1228620 Imputation.hg38/Vars.Rsq03MAF00001/Rsq03MAF00001.chr18.txt
   959626 Imputation.hg38/Vars.Rsq03MAF00001/Rsq03MAF00001.chr19.txt
  3743334 Imputation.hg38/Vars.Rsq03MAF00001/Rsq03MAF00001.chr2.txt
   973996 Imputation.hg38/Vars.Rsq03MAF00001/Rsq03MAF00001.chr20.txt
   566318 Imputation.hg38/Vars.Rsq03MAF00001/Rsq03MAF00001.chr21.txt
   584809 Imputation.hg38/Vars.Rsq03MAF00001/Rsq03MAF00001.chr22.txt
  3132725 Imputation.hg38/Vars.Rsq03MAF00001/Rsq03MAF00001.chr3.txt
  3032320 Imputation.hg38/Vars.Rsq03MAF00001/Rsq03MAF00001.chr4.txt
  2844925 Imputation.hg38/Vars.Rsq03MAF00001/Rsq03MAF00001.chr5.txt
  2749732 Imputation.hg38/Vars.Rsq03MAF00001/Rsq03MAF00001.chr6.txt
  2525807 Imputation.hg38/Vars.Rsq03MAF00001/Rsq03MAF00001.chr7.txt
  2450662 Imputation.hg38/Vars.Rsq03MAF00001/Rsq03MAF00001.chr8.txt
  1924329 Imputation.hg38/Vars.Rsq03MAF00001/Rsq03MAF00001.chr9.txt
 43429887 total

What is needed:

  1. plink1 binaries containing variants with Rsq > 0.3, maf > 0.0001

Location of files:

  1. USmerged

    • merged imputed files: /data/NDRS_LNG/MyastheniaGravis/updated.April2020/US/Analysis.GLM.hg38/US.JointPostImputation/merged.vcf/US.chr${CHNUM}.vcf.gz
    • List of variants to keep: /data/NDRS_LNG/MyastheniaGravis/updated.April2020/US/Analysis.GLM.hg38/US.JointPostImputation/sharedVars.postImputation/SharedVars.postImputation.Rsq03MAF00001.chr${CHNUM}.txt
    • Covariate file: /data/NDRS_LNG/MyastheniaGravis/updated.April2020/US/COVARIATES.US_mg_noDups.UNRELATED.forImputed.txt
  2. Italy

    • Imputed files: /data/NDRS_LNG/MyastheniaGravis/updated.April2020/Itals/Imputation.hg38 /chr${CHNUM}.dose.vcf.gz
    • List of variants to keep: /data/NDRS_LNG/MyastheniaGravis/updated.April2020/Itals/Imputation.hg38/Vars.Rsq03MAF00001/Rsq03MAF00001.chr${CHNUM}.txt
    • Covariate file: /data/NDRS_LNG/MyastheniaGravis/updated.April2020/Itals/COVARIATES.Itals_mg_noDups.UNRELATED.forGLM.txt

Itals

In [ ]:
%%bash
cd Imputation.hg38
mkdir plinkForPRS

DIR="/data/NDRS_LNG/MyastheniaGravis/updated.April2020/Itals"

# Make list of unrelated samples to keep 
awk '{print $1,$2}' $DIR/COVARIATES.Itals_mg_noDups.UNRELATED.forGLM.txt > $DIR/Imputation.hg38/SampleList.Itals.UNRELATED.forImputed.txt

for CHNUM in {1..22}
do 
echo "sh vcfImputedToplink1.Itals.sh $DIR $CHNUM" >> generatePRSplink.swarm
done

swarm --file generatePRSplink.swarm --logdir swarmOE_vcfToPlink -g 120 \
--time 2:00:00 -t auto --module plink/1.9.0-beta4.4 --partition quick
In [8]:
%%bash
# view log file
cd Imputation.hg38/
cat swarmOE_vcfToPlink/swarm_10041423_0.o
---- COMMAND EXECUTED: ---------------------------------------------------------
(   sh vcfImputedToplink1.Itals.sh /data/NDRS_LNG/MyastheniaGravis/updated.April2020/Itals 1 )
--------------------------------------------------------------------------------
vcfImputedToplink1.Itals.sh running
This script should be executed in biowulf. If not, please terminate job.
PLINK v1.90b4.4 64-bit (21 May 2017)           www.cog-genomics.org/plink/1.9/
(C) 2005-2017 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to plinkForPRS/Itals.UNRELATED.Imputed.pass.Rsq03MAF00001.chr1.log.
Options in effect:
  --allow-no-sex
  --double-id
  --extract /data/NDRS_LNG/MyastheniaGravis/updated.April2020/Itals/Imputation.hg38/Vars.Rsq03MAF00001/Rsq03MAF00001.chr1.txt
  --keep /data/NDRS_LNG/MyastheniaGravis/updated.April2020/Itals/Imputation.hg38/SampleList.Itals.UNRELATED.forImputed.txt
  --keep-allele-order
  --make-bed
  --out plinkForPRS/Itals.UNRELATED.Imputed.pass.Rsq03MAF00001.chr1
  --pheno /data/NDRS_LNG/MyastheniaGravis/updated.April2020/Itals/COVARIATES.Itals_mg_noDups.UNRELATED.forGLM.txt
  --pheno-name PHENO
  --threads 32
  --vcf /data/NDRS_LNG/MyastheniaGravis/updated.April2020/Itals/Imputation.hg38/chr1.dose.vcf.gz

128639 MB RAM detected; reserving 64319 MB for main workspace.
--vcf:
plinkForPRS/Itals.UNRELATED.Imputed.pass.Rsq03MAF00001.chr1-temporary.bed +
plinkForPRS/Itals.UNRELATED.Imputed.pass.Rsq03MAF00001.chr1-temporary.bim +
plinkForPRS/Itals.UNRELATED.Imputed.pass.Rsq03MAF00001.chr1-temporary.fam
written.
23617900 variants loaded from .bim file.
3497 people (0 males, 0 females, 3497 ambiguous) loaded from .fam.
Ambiguous sex IDs written to
plinkForPRS/Itals.UNRELATED.Imputed.pass.Rsq03MAF00001.chr1.nosex .
3397 phenotype values present after --pheno.
--extract: 3472175 variants remaining.
--keep: 3397 people remaining.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 3397 founders and 0 nonfounders present.
Calculating allele frequencies... 0%1%2%3%4%5%6%7%8%9%10%11%12%13%14%15%16%17%18%19%20%21%22%23%24%25%26%27%28%29%30%31%32%33%34%35%36%37%38%39%40%41%42%43%44%45%46%47%48%49%50%51%52%53%54%55%56%57%58%59%60%61%62%63%64%65%66%67%68%69%70%71%72%73%74%75%76%77%78%79%80%81%82%83%84%85%86%87%88%89%90%91%92%93%94%95%96%97%98%99% done.
3472175 variants and 3397 people pass filters and QC.
Among remaining phenotypes, 919 are cases and 2478 are controls.
--make-bed to plinkForPRS/Itals.UNRELATED.Imputed.pass.Rsq03MAF00001.chr1.bed +
plinkForPRS/Itals.UNRELATED.Imputed.pass.Rsq03MAF00001.chr1.bim +
plinkForPRS/Itals.UNRELATED.Imputed.pass.Rsq03MAF00001.chr1.fam ... 0%1%2%3%4%5%6%7%8%9%10%11%12%13%14%15%16%17%18%19%20%21%22%23%24%25%26%27%28%29%30%31%32%33%34%35%36%37%38%39%40%41%42%43%44%45%46%47%48%49%50%51%52%53%54%55%56%57%58%59%60%61%62%63%64%65%66%67%68%69%70%71%72%73%74%75%76%77%78%79%80%81%82%83%84%85%86%87%88%89%90%91%92%93%94%95%96%97%98%99%done.
In [ ]: